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Abstract 

The Shannon (block) entropy of the clusters generated by intersecting a long-range correlated 
sequence with its moving average is studied. The entropy is given by two terms, respectively 
increasing logarithmically and linearly (besides a constant term), corresponding to clusters with 
power-law or exponentially distributed lengths. Then, the entropy measure is implemented on 
the 24 human chromosome sequences. Interestingly, it is found that, for the power-law correlated 
clusters, the nucleotide composition is, on the average, equal to the nucleotide composition of the 
whole sequence, while, for the exponentially correlated clusters, it fluctuates around the average 
value. Even more interestingly, it is found that the variance of the fluctuations is a characteristic 
property of each chromosome. How these fluctuations correlate to biological properties such as 
segmental duplications, gene density of each chromosome is finally discussed. 
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I. INTRODUCTION 



Complex systems are probed by observing a relevant quantity over a certain temporal or 
spatial range, yielding long-range correlated sequences or arrays, with the remarkable feature 
of displaying 'ordered' patterns, which emerge from the seemingly random structure. The 
degree of 'order' is intrinsically linked to the information embedded in the patterns, whose 
extraction and quantification might add clues to many complex phenomena 1KZ]. 

In this work, we put forward an information measure for long-range correlated sequences, 
worked out from a partition of the sequence into words or blocks (here called clusters) 
according to 6|. The clusters are characterized by their length £, duration r and area A, 
obeying power-law probability distributions, with a cross-over to an exponential decay at 
large size. Further information measures based on this partition approach and extension to 
higher dimensions might be envisaged for future development 0, S] • 

The probability distribution function of the lengths is considered to estimate the Shannon 
(block) entropy of the clusters. It is shown that the entropy can be written as the sum of three 
terms: a constant, a logarithmic and a linear function of the cluster length. The clusters 
with dominant logarithmic term of the entropy are power-law correlated and correspond to 
'ordered' structures, while those with dominant linear term are exponentially distributed 
and correspond to 'disordered' structures. 

The information measure is illustrated by analyzing the 24 nucleotide sequences of the 
human chromosomes Q|. Each sequence is first mapped to a fractional Brownian walk (the 
so-called DNA walk). Then, the probability distribution function P(£) and the entropy 
S(£) of the DNA clusters are estimated by adopting the aforeproposed approach. It is 
shown that the power-law correlated clusters are characterized by a nucleotide content (in 
terms of purine-pyrimidine pairs (G+C)% and (A+T)%) on the average equal to the value 
of the whole chromosome sequence. Conversely, the exponentially correlated clusters are 
characterized by a percentage of purine-pyrimidine pairs exhibiting fluctuations around the 
value of the whole chromosome sequence. Interestingly, it is shown that the variance oc 
of the cluster composition fluctuations for each of the 24 chromosomes is correlated to 
biologically relevant properties, such as duplication frequency and gene density. It should 
be pointed out that the entropy of a sequence, coded in blocks or words, has been extensively 
studied since its introduction by Shannon (see and Refs. therein). The novelty of the 
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present approach resides in the method used for partitioning the sequence in words, which 
provides the set of inherently correlated ones in a straightforward manner. 



II. METHOD 

A random sequence y(x) can be partitioned in elementary clusters by considering the 
intersection with the moving average y n (x) where n is the size of the moving window. The 



clusters correspond to the regions 
crossings points x c (i) and x c (i + 1) 



Dounded by y(x) and y n (x) between two subsequent 
6j. The intersection between y(x) and y n (x) produces 
a generating partition, yielding different sequences of clusters for different values of n. 

The probability distribution function P(£, n) of the lengths £ for each n can be obtained 
by counting the clusters flf(£i,n),J\f(£2,n), ...,J t \f(£ i ,n) respectively with length £±,£2, ■■■■,£%■ 
By doing so, one obtains: 

P(£,n) ~r D F(£,n) ~/x(^,n)- 1 , (1) 

where D = 2 — H is the fractal dimension of the sequence and the function T (£, n) can be 
taken of the form: 

T{£,n) = exp(-£/n) . (2) 

T (£, n) accounts for the drop-off of P(£, n) due to fmiteness of n when £ ^> n. The quantity 
(i(£,ri) = £ D exp(£/n) corresponds to the size of the cluster lengths spanned by the random 
walkers. fi(£, n) ranges from a line proportional to £ for H = 1 to a square proportional to £ 2 
for H = respectively. The probability distribution function P(£,n) is shown in Fig. [T]for 
a wide range of n values, estimated for a long range correlated series with Hurst exponent 
H f« 0.6. For n — > 1, the lengths £ of the elementary clusters are centered around a single 
value. When n increases, a broader range of lengths is obtained and, consequently, P(£, n) 
spreads over all values. 

nn 

The Shannon (block) entropy is defined as [2|-|4|]: 

S(£,n) = -J2 P(£,n)logP(£,n) , (3) 

where the sum is performed over all the elementary clusters. By using Eq. ([T]) and Eq. (J3j), 
the cluster entropy writes: 

S(£,n) = S + \og£ D -\ogF(£,n), (4) 



which, after taking into account Eq. <^fy, becomes: 

S(£,n) = S + \og£ D + ^ , (5) 

where So accounts for the constant terms, \og£ D is related to the term £~ D and £/nis related 
to the term T{£, n). 

To clarify the meaning of the terms appearing in Eq. (jSJ), it is worthy of remarking that for 
isolated systems, the entropy increase is related to the irreversible processes spontaneously 
occurring within the system. It tends to a constant value as a stationary state is asymptot- 
ically reached (dS > 0). For open systems interacting with their environment, the entropy 
increase is given by a term dSi > 0, due to the irreversible processes spontaneously occurring 
within the system, and a term dS e due to the irreversible processes arising through external 
interactions. The term log£ D in Eq. flH]) should be interpreted as an intrinsic entropy Si. It is 
independent of the method of partitioning the sequence (which plays the role of the external 
interaction) and in particular of n. The logarithmic term is of the form of the Boltzmann 
entropy S = logfi, where Q is the maximum volume occupied by the isolated system. The 
quantity £ D corresponds to the volume occupied by the random walker. Whenever £ could 
reach the maximum size L of the sequence, the second term on the right side writes logL D . 
The term £/n represents the excess entropy S e introduced by the partition process. It comes 
into play when the long-range correlated sequence is partitioned in clusters and depends on 
n. 

Fig. [2] shows the entropy S(£, n) evaluated by using the probability distribution P(£, n) 
plotted in Fig. [TJ One can note that S(£,n) increases logarithmically at small £, while it 
increases as a linear function at larger £, as expected according to Eq. (jSJ). In particular, 
S(£,n) behaves as log£ D and is n-invariant for small values of £. Clusters with lengths £ 
greater than n are not power-law correlated, due to the finite-size effects introduced by the 
window n. They are characterized by a value of the entropy exceeding the values lying on 
the curve log£ D . Clusters with a given length £ can be generated by different values of the 
window n. For example, clusters with £ = 2500 have entropies corresponding to the point A 
(for n = 1000) or A" (for n = 3000 and n = 10000) as shown in Fig. [2j One can observe that 
A" corresponds to power-law correlated (i.e. ordered) clusters, since A" lies on the curve 
log£ D . Conversely, the point A does not correspond to power-law correlated clusters, since A 
lies on the curve £/n which originates from the term jF{£,n). In other words, clusters with 
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lengths shorter than n are ordered (long-range correlated), whereas clusters with lengths 
larger than n are disordered (exponentially correlated). 

To gain further insight in the meaning of the terms appearing in Eq. (j5J), the source 
entropy rate s, is calculated for the entropy S(£,n). The source entropy rate is a measure 
of the excess randomness of the coding algorithm. Noisy coding processes are characterized 



by large values of s [2]. By using the definition and Eq. (jSJ), the source entropy rate writes: 

, slim *M = I. (6) 

c^oo in 

The excess randomness of the clusters is inversely proportional to n and, thus, becomes 
negligible in the limit of n — >• 00. This can be clearly observed in the curves of Fig. [21 where 
one can note that higher entropy rates correspond to steeper slopes of the linear term £/n. 



A. Application: the 24 Human Chromosome Sequences. 

The publication of several completely sequenced genomes from many species provides a 
unique opportunity to develop suitable statistical methods and yield benchmarks for the 



identification of biological features (9|-|29| 



The two strands of DNA are held together by hydrogen bonds between complementary 
bases: two bonds for the AT pair and three bonds for the GC pair, which is therefore 
stronger. The existence of GC-rich and GC-poor segments may have different biological 
role in biological processes as duplication, segmentation, unzipping. For example, the dis- 
tributions of genes and repetitive elements were found to be associated with particular GC 
contents. Small and medium sized genes were found to be more abundant in GC-rich regions 
of human genome, whereas long genes (typically genes with long introns) were found to be 
scarce in GC rich regions. 

Nonuniformity of nucleotides composition within genomes was revealed several decades 
ago by thermal melting and gradient centrifugation. On the basis of findings concerning 
buoyant densities of melted DNA fragments, a theory for the structure of genomes of warm- 
blooded vertebrates known as the isochores theory was put forward. Isochores were defined 
as long genomic segments that are fairly homogeneous in their guanine and cytosine (GC) 
composition. Though it is widely accepted that the human genome contains large regions 
of distinctive GC content, the availability of fully sequenced DNA or RNA molecule allows 
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one to accurately calculate the local properties by statistical methods. The development of 
efficient algorithms achieving deep and accurate description of genomes. 

In this section, the information metrics described above are implemented on genomic 
sequences. The main computational steps are described here following. 

The sequence of the nucleotide bases ATGC (FASTA file) are mapped according to the 
following rule: if the base is a purine (A,G), the base is mapped to +1, otherwise if the base 
is a pyrimidine (C,T), the base is mapped to —1 (Fig. [3]). The sequence of +1 and —1 is 
summed and a random walk y{x) {DNA walk) is obtained (Fig. [3]). 

The function y n {x) is calculated for the DNA walk with different values of the window n. 
The intersection between y{x) and y n (x) yields a set of clusters. Each cluster corresponds 
to a segment between two adjacent intersections at x c (i) and x c (i + 1) of y(x) and y n {x). 
Since each cluster of the DNA walk corresponds to a cluster of nucleotides, the number of 
nucleotides ATGC is counted in each cluster. The ATGC count is plotted as a function 
of the cluster length £. In Figs. HJ|6] the nucleotide composition of the clusters is shown 
for the 24 human chromosomes. The nucleotide composition is plotted as a function of the 
cluster length I for three values of n: namely n = 2, n = 4 and n = 10. However, the 
range of n values used in this work varied from 2 to 10,000. One can observe that the 
nucleotides count is roughly constant for clusters having length comparable or shorter than 
n. This means that 'ordered' DNA clusters, i.e. with constant nucleotide composition, are 
found, when the entropy varies as a logarithm of I. For cluster length I comparable or larger 
than n, the power-law correlation breakdown with the onset of exponentially correlated 
clusters ('disordered' clusters). An even more interesting results of our computation is that 
the amplitude of the fluctuations is not constant but they assume a characteristic value 
for each chromosome. For example, one can note in Figs. H]|6] the fluctuations of the 
cluster composition is very small for Chromosomes 8, 9, 17, Y. Conversely, they are quite 
large for Chromosomes 14, 15, X. The next step is addressed to understanding whether the 
nucleotide fluctuations, characterizing each chromosome, are linked to biological features. 
To this purpose, the variance oc of the fluctuations has been calculated for the nucleotide 
composition ATGC of the clusters (values are in Table [TT|- The variances oc have then 
be correlated to other characteristic features of each chromosome: length, gene density, 
inter-chromosomal duplications, intra-chromosomal duplications, local ATGC composition. 
The correlation coefficients pc are shown in Table IIHI Negative correlations between oq 
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and intra-and inter-chromosomal duplications are found. Strong positive correlations are 
observed between oq and AT-rich regions. These results might point to the important 
result that the cluster fluctuations are a fingerprint of recent segmental duplications. Dup 



Conclusions 



We have further developed an information measure based on the approach put forward in 



Refs 



7J and the concept of Shannon (block) entropy. The proposed approach is particular 
appealing thanks to the novelty and effectiveness of the partitioning method for generating 
the blocks. The entropy writes as the sum of two contributions: one increasing logarithmi- 
cally and the other linearly with the lengths of the blocks or words. 

Furthermore, it has been shown that the variance of the clusters with fluctuating com- 
position is related to relevant biological quantities like segmental duplications and AT-rich 
regions. The algorithm performs the following tasks: (i) separation of blocks according to 
their length and correlation degree; (ii) separation of DNA segments according to the nu- 
cleotide composition. Compared to already existing techniques to identify segments with 
certain nucleotide contents, the present method is able to give the requested output even 
with extremely short sequences (as shown in the plots of Figures H] -[7]) . Last but not least, 
the proposed procedure is able to identify DNA segments correlated to important biological 
processes-such as intra- and inter-chromosomal segmental duplications. 
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TABLE I: Length L (2 column), Hurst exponent H (3 column), base composition (% of ATCG, 
4 -7 columns) of the 24 chromosomes sequences. Base composition (% of the ATCG, 8 th -ll th 
columns) of the first lOMBases of the 24 chromosome sequences. 



CHR 


L 


H 


A \%] 


Cf%l 


G[%1 


T[%1 




C[%1 


G[%] 


T\%] 


1 


226217758 


0.64 


29.09 


20.87 


20.87 


29.14 


26.52 


25.79 


25.58 


25.15 


2 


237900011 


0.66 


29.84 


20.11 


20.13 


29.90 


28.50 


24.51 


22.34 


29.81 


3 


195304882 


0.66 


30.14 


19.84 


19.84 


30.16 


28.46 


21.77 


21.50 


28.65 


4 


187941502 


0.66 


30.87 


19.11 


19.12 


30.88 


34.47 


19.80 


22.86 


30.28 


5 


177847050 


0.66 


30.20 


19.74 


19.77 


30.27 


29.97 


24.86 


19.70 


36.43 


6 


169100547 


0.65 


30.18 


19.80 


19.81 


30.19 


29.97 


21.23 


21.73 


28.27 


7 


155403473 


0.66 


29.60 


20.38 


20.36 


29.63 


28.85 


21.79 


27.09 


22.93 


8 


143332430 


0.65 


29.90 


20.06 


20.06 


29.86 


29.51 


20.61 


20.85 


29.03 


9 


120994158 


0.67 


29.35 


20.65 


20.64 


29.33 


27.91 


21.83 


21.38 


28.76 


10 


131739836 


0.65 


29.19 


20.79 


20.78 


29.22 


31.15 


19.94 


19.47 


29.44 


11 


131247160 


0.68 


29.20 


20.77 


20.79 


29.21 


28.97 


22.23 


25.08 


26.85 


12 


130304143 


0.67 


29.59 


20.40 


20.39 


29.60 


30.66 


22.85 


24.19 


29.62 


13 


95747346 


0.66 


30.69 


19.26 


19.26 


30.77 


33.94 


20.95 


21.82 


33.88 


14 


88290585 


0.67 


29.44 


20.41 


20.46 


29.67 


33.94 


20.95 


21.82 


33.88 


15 


81927784 


0.66 


28.89 


21.13 


21.10 


28.86 


32.64 


21.69 


20.74 


33.00 


16 


78990748 


0.67 


27.53 


22.35 


22.44 


27.66 


29.17 


24.30 


22.89 


31.49 


17 


79620483 


0.65 


27.17 


22.81 


22.76 


27.22 


25.36 


24.80 


24.73 


24.87 


18 


74660927 


0.67 


30.09 


19.87 


19.90 


30.12 


25.36 


24.80 


24.73 


24.87 


19 


56038018 


0.66 


25.79 


24.14 


24.20 


25.86 


32.65 


21.61 


23.15 


30.64 


20 


59505758 


0.66 


27.76 


22.02 


22.09 


28.10 


29.01 


24.09 


19.02 


36.45 


21 


35452914 


0.65 


29.68 


20.39 


20.44 


29.46 


32.27 


19.25 


21.18 


27.29 


22 


35059666 


0.65 


26.08 


23.98 


23.95 


25.96 


28.30 


22.93 


24.63 


24.92 


X 


152580014 


0.65 


30.20 


19.73 


19.76 


30.26 


32.88 


20.86 


25.39 


28.01 


Y 


25654723 


0.72 


29.88 


19.87 


20.08 


30.14 


27.45 


22.05 


24.21 


26.49 
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FIG. 1: Probability distribution function P(£, n) of cluster lengths for a sequence with H « 0.6 and 
L = 2 20 . The moving average windows are n = 500, n = 1000, n = 2000, n = 3000 and n = 10000 
(from left to right). As n increases, P(£, n) becomes broader. The slope of the distribution becomes 
steeper for t > n, corresponding to the onset of finite-size effects and exponentially decaying 
correlation. 
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FIG. 2: Entropy S(£,n) of the clusters according to the PDFs P(£,n) plotted in Fig. [TJ 
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TABLE II: Variance ac of the cluster composition fluctuations plotted in Figures I4I5I6I7I (% of the 
ATCG nucleotides) for the 24 chromosome sequences. 



CHR 


on [Al 

O L J 


or [CI 

^ L J 


on \G] 

^ L J 


an [Tl 


1 


11.01 


10.81 


10.06 


9.68 


2 


14.19 


12.90 


12.43 


14.30 


3 


10.05 


9.43 


8.29 


8.52 


4 


16.56 


12.91 


13.87 


17.79 


5 


19.75 


14.32 


12.76 


16.69 


6 


9.07 


6.33 


7.42 


8.71 


7 


8.58 


8.32 


11.14 


10.21 


8 


4.89 


3.88 


4.33 


4.91 


9 


6.49 


4.97 


4.36 


5.23 


10 


10.84 


9.04 


7.52 


9.38 


11 


9.12 


8.83 


11.69 


10.87 


12 


17.09 


14.86 


14.13 


15.63 


13 


17.12 


12.01 


13.78 


18.11 


11 


19.53 


13.06 


13.43 


19.26 


15 


16.55 


14.08 


12.54 


16.61 


1(3 


16.31 


15.60 


15.77 


15.69 


17 


5.06 


5.17 


4.95 


5.67 


18 


20.02 


13.98 


14.10 


18.47 


19 


17.90 


14.88 


13.99 


17.10 


20 


17.31 


13.86 


14.33 


18.70 


21 


10.84 


7.69 


8.50 


10.81 


22 


8.40 


5.81 


5.53 


8.48 


X 


18.06 


14.09 


14.87 


19.06 


Y 


6.19 


6.66 


7.08 


7.47 
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TABLE III: Correlation pc of the cluster variances oc for the first lOMbases (Mi), the second 
lOMbases (M2) and the third lOMbases (M3) of the Human 24 chromosomes. The cluster variance 
oc is anticorrelated with length, gene density, inter-chromosomal and intra-chromosomal segmental 
duplications. The cluster variance oc exhibits a positive correlation with AT-rich regions. Very 
little correlation is found with GC-rich region and global AT composition. 





PC [Mi] 


PC [M 2 ] 


PC [M 3 ] 


Length a 


-0.194 


-0.552 


-0.582 


Gene density 6 


-0.178 


-0.076 


-0.107 


Inter-chromosomal duplications 


-0.330 


-0.242 


-0.165 


Intra-chromosomal duplications' 1 


-0.342 


-0.248 


-0.158 


All pairwise duplications' 1 


-0.331 


-0.237 


-0.149 


Local composition A d 


+0.658 


+0.762 


+0.461 


Local composition T d 


+0.668 


+0.674 


+0.551 


Local composition C d 


+0.021 


+0.039 


+0.269 


Local composition G d 


-0.149 


+0.211 


+0.246 


Global composition AT e 


+0.052 


-0.154 


-0.219 



I 






10 
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"Lengths are in the 2 nd column of Table! 
b Gene Density data are taken from Rcf 

c Inter- and Intra chromosomal duplications data are taken from Ref. 
d Local base composition are shown in the 7 th -lO th columns in Table I for the first 10 MBases 
e Data for global base composition are in Table I 
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FIG. 3: Scheme of the first 30 bases sequence of the human chromosome 1. From Top to Bottom: 
nucleotide sequence; the corresponding +1 and -1 sequence; the DNA walk y(x) obtained by 
summing the increments +1 and -1. 
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FIG. 4: Base composition (% of ATCG nucleotides ) of the clusters in the human chromosomes 1, 
2, 3, 4, 5, 6. For each chromosome, the plots refer to windows n = 2, n = 4, n = 10 
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FIG. 5: Same as Fig|4]but for the chromosomes 7, 8, 9, 10, 11, 12. 
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FIG. 6: Same as Figj4]but for the chromosomes 13, 14, 15, 16, 17, 18. 
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FIG. 7: Same as FigHbut for the chromosomes 19, 20, 21, 22, X, Y. 
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